Project 4 is due by 5pm on Tuesday January 14. Please submit both a .Rmd and a .pdf file on Blackboard by the deadline and drop off a hard copy of the pdf file at 26 Prospect Avenue by 6pm on the due date. To look for the drop-off cabinet, after you enter the building turn to the left to enter the lounge area and the file cabinet will then be on your right; you can just drop your report into the open slot of the cabinet labeled “SML 201 Homework”; note that the building might be locked after 6pm and on weekends. You are also welcome to drop off the pdf copy in advance of the deadline.
No late projects will be accepted after the deadline. The University forbids instructors granting extensions past Dean’s Date without the approval of students’ either college Dean or Director of Study. Please contact your college Dean or Director of Study directly before the project deadline in case you need an extension.
This project can be completed in groups of up to 3 students. It is okay to work by yourself, if this is preferable. You are welcome to get help (you can either ask questions on Piazza or talk to instructors in person during office hours) from instructors; however, please do not post code/solutions on Piazza on a public post.
You are encouraged to get help from the instructors (either through Piazza or in person) if you need help to understand the definitions of the variables of the dataset or the procedure of the experiment.
When working in a group it is your responsibility to make sure that you are satisfied with all parts of the report and the submission is on time (e.g., we will not entertain arguments that deficiencies are the responsibility of other group members). We expect that you each work independently first and then compare your answers with each other once you all finish, or you all work together on your own laptops. Failing to make contributions and then putting your name on a report will be considered a violation of the honor code. Please do not divide work among group mates. Everyone in your group is responsible for being able to explain the solutions in the report.
For all parts of this problem set, you MUST use R commands to print the output as part of your R Markdown file. You are not permitted to find the answer in the R console and then copy and paste the answer into this document.
If you are completing this project in a group, please have only one person in your group turn in the .Rmd and .pdf files; the other person in your group should turn in the list of the people in your group in the Text Submission field on the submission page. This means that everyone should make a submission–either a file-upload or a text submission–regardless of whether you are working in a group or not.
The physical pdf report that you drop off and the .pdf file that you submit on Blackboard should be identical. Modifying your report after the deadline could result in a penalty as much as getting a zero score for the assignment.
Please type your name(s) after “Digitally signed:” below the honor pledge to serve as digital signature(s). Put the pledge and your signature(s) at the beginning of each document that you turn in.
I pledge my honor that I have not violated the honor code when completing this assignment.
Digitally signed: Bill Haarlow & Weston Carpenter
In order to receive full credits, please have sensible titles and axis labels for all your graphs and adjust values for all the relevant graphical parameters so that your plots are informative. Also, all answers must be written in complete sentences.
(Hypothetical) Congrats! You have been hired as an intern at Redfin (Redfin.com) in Seattle. As for your first project your manager would like you to build a new model for Redfin to predict the sold prices for houses in the King county area of the Washington state.
To see the prices predicted by Redfin’s current model, you can see the number shown near the top of the web page of a listing; e.g., for this house (https://www.redfin.com/WA/Seattle/132-NE-95th-St-98115/unit-B108/home/2316), the Redfin estimate is $394,279 as of Dec. 9, 2019.)
We will use a subset of the dataset
house_data_cleaned.csv to build the model. The dataset
contains sold prices for houses in King County (in Washington state),
including Seattle, for transactions made between May 2014 and May 2015.
We will use only a subset of the variables in
kc_house_data.csv because we only want to include variables
that have clear definitions and seem relevant for house prices. A
description of the original dataset house_data_cleaned.csv
and the complete list of the variable definitions can be found here (https://www.kaggle.com/harlfoxem/housesalesprediction/data).
We will use the dataset house_data_cleaned.csv and the
definitions for the variables in the dataset are listed below (see the
website address provided above for the complete list of the
variables).
BLDGGRADE on https://www5.kingcounty.gov/sdc/FGDCDocs/RESBLDG_EXTR_faq.htm)# Load required packages
library(ggplot2)
library(GGally)
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
Read the dataset subset_kc_house_data.csv into
R and name it house. house should
have 21605 rows and 15 columns.
# Reads the dataset into R
house = read.csv("/Users/billhaarlow/Desktop/SML201/Projects/Project4/house_data_cleaned.csv")
dim(house)
[1] 21605 15
str(house)
'data.frame': 21605 obs. of 15 variables:
$ date : Factor w/ 372 levels "20140502T000000",..: 165 221 291 221 284 11 57 252 340 306 ...
$ price : num 221900 538000 180000 604000 510000 ...
$ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
$ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
$ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
$ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
$ floors : num 1 2 1 1 1 1 2 1 1 2 ...
$ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
$ condition : int 3 3 3 5 3 3 3 3 3 3 ...
$ grade : int 7 7 6 7 8 11 7 7 7 7 ...
$ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
$ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
$ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
$ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ...
$ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
head(house)
price and other
variables.Below are the matrices of scatterplots (you will need to remove
eval = F), colored by the values in
waterfront, for most the variables in the dataset. The
variables date and zipcode are not included
for the plots.
Since we have limited time for this project I have investigated the
relationship between the month of sale and price and did
not see any patterns; therefore, we will not include the date-related
information in our model.
Also, note that including zipcode in a scatterplot
matrix will not be helpful since the graph will be too small to show the
relationship between price and zipcode; thus,
we will investigate the relationship between price and
zipcode separately later in the report.
ggpairs(house[, c(5:6, 8, 11:12, 2)], aes(color = as.factor(waterfront),
alpha = 0.3), upper = list(continuous = wrap("cor", size = 3.5)),
lower = list(continuous = wrap("points", alpha = 0.3, size = 0.1))) +
theme(axis.text.x = element_text(angle = 45, size = 6))
ggpairs(house[, c(3:4, 7:8, 2)], aes(color = as.factor(waterfront),
alpha = 0.3), upper = list(continuous = wrap("cor", size = 3.5)),
lower = list(continuous = wrap("points", alpha = 0.3, size = 0.1))) +
theme(axis.text.x = element_text(angle = 45, size = 6))
ggpairs(house[, c(8:10, 13:14, 2)], aes(color = as.factor(waterfront),
alpha = 0.3), upper = list(continuous = wrap("cor", size = 3.5)),
lower = list(continuous = wrap("points", alpha = 0.3, size = 0.1))) +
theme(axis.text.x = element_text(angle = 45, size = 6))
Are there any pair(s) of x-variables with correlation value greater than .8? Report the pair(s). In general, it is good to have x-variables that are highly correlated (i.e., with correlation close to -1 or 1) among themselves in your model? Explain why yes or why no.
Answer: The pair sqft_above, sqft_living has a
correlation value of 0.877. In general, it is not good to have
x-variables that are highly correlated in a model, because collinearity
(strong pairwise correlations between variables) makes interpretation of
the model near impossible.
For each of the variables, grade and
bedrooms, calculate the average house price for each unique
value of the variable and make a scatterplot for the average house
prices v.s. the unique values of the variable; e.g., grade
takes on the integer values 3 through 13 so your scatterplot for
grade should have 11 points, the first point should have
x-coordinate 3 and y-coordinate the average price for all the houses
with grade = 3; the second point should have x-coordinate 4
and the y-coordinate the mean price for all the houses with
grade = 4, and so on. Please include the origin for the
graph for grade. Do the patterns on the two scatterplots
look linear?
avg.price.grade = with(house, tapply(X = price, INDEX = grade, FUN = mean))
avg.price.bedrooms = with(house, tapply(X = price, INDEX = bedrooms,
FUN = mean))
ggplot() + geom_point(aes(x = sort(unique(house$grade)), y = avg.price.grade)) +
scale_x_continuous(limits = c(0, 13), breaks = c(0:13)) + labs(y = "Average House Price ($)",
x = "Grade", title = "Average House Price Per Grade")
ggplot() + geom_point(aes(x = sort(unique(house$bedrooms)), y = avg.price.bedrooms)) +
scale_x_continuous(limits = c(0, 11), breaks = c(0:11)) + labs(y = "Average House Price ($)",
x = "Number of Bedrooms", title = "Average House Price Per Number of Bedrooms")
Answer: The patterns on the two scatterplots do not look linear.
True or False According to the assumptions of linear models if we take average of the y-values that correspond to the each given x-value, the averages should roughly form a straight line.
Answer: A
Make side by side boxplots to compare the values in
price by zip code. Based on your boxplots, is it good to
include the dummy variables for some of the zip codes in your model?
Explain.
ggplot(data = house) + geom_boxplot(aes(x = as.factor(zipcode),
y = price, group = zipcode)) + labs(x = "Zip code", y = "Price ($)",
title = "Price Distributions by Zip Code") + coord_flip()
Answer: Based on our boxplots, it is a good idea to include the dummy variables for a few select zip codes. This is because a small number of price distributions, such as (but not limited to) zip codes 98039, 98004, and 98112 differ noticeably from most of the other price distributions.
Consider the following model (you will need to remove
eval=F to run the code below):
summary(lm(price ~ factor(zipcode), data = house))
Call:
lm(formula = price ~ factor(zipcode), data = house)
Residuals:
Min 1Q Median 3Q Max
-1373107 -126733 -36653 64205 6800605
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 281195 14898 18.875 < 2e-16 ***
factor(zipcode)98002 -46911 24992 -1.877 0.060526 .
factor(zipcode)98003 12916 22541 0.573 0.566646
factor(zipcode)98004 1074732 21788 49.327 < 2e-16 ***
factor(zipcode)98005 528970 26436 20.009 < 2e-16 ***
factor(zipcode)98006 578490 19566 29.565 < 2e-16 ***
factor(zipcode)98007 335910 28111 11.949 < 2e-16 ***
factor(zipcode)98008 364312 22474 16.210 < 2e-16 ***
factor(zipcode)98010 142471 31988 4.454 8.47e-06 ***
factor(zipcode)98011 209157 25157 8.314 < 2e-16 ***
factor(zipcode)98014 174422 29464 5.920 3.27e-09 ***
factor(zipcode)98019 143594 25371 5.660 1.53e-08 ***
factor(zipcode)98022 34514 23756 1.453 0.146282
factor(zipcode)98023 5538 19558 0.283 0.777065
factor(zipcode)98024 304814 34979 8.714 < 2e-16 ***
factor(zipcode)98027 335796 20407 16.455 < 2e-16 ***
factor(zipcode)98028 181285 22474 8.066 7.61e-16 ***
factor(zipcode)98029 331459 21716 15.264 < 2e-16 ***
factor(zipcode)98030 14993 23129 0.648 0.516835
factor(zipcode)98031 19146 22704 0.843 0.399080
factor(zipcode)98032 -29899 29376 -1.018 0.308791
factor(zipcode)98033 522525 20185 25.887 < 2e-16 ***
factor(zipcode)98034 240458 19209 12.518 < 2e-16 ***
factor(zipcode)98038 85673 18914 4.529 5.95e-06 ***
factor(zipcode)98039 1879412 42714 44.000 < 2e-16 ***
factor(zipcode)98040 913035 22496 40.586 < 2e-16 ***
factor(zipcode)98042 30437 19188 1.586 0.112689
factor(zipcode)98045 158276 24177 6.547 6.02e-11 ***
factor(zipcode)98052 364037 19014 19.145 < 2e-16 ***
factor(zipcode)98053 395440 20501 19.289 < 2e-16 ***
factor(zipcode)98055 23067 22824 1.011 0.312189
factor(zipcode)98056 139696 20477 6.822 9.21e-12 ***
factor(zipcode)98058 72414 19951 3.630 0.000285 ***
factor(zipcode)98059 212358 19828 10.710 < 2e-16 ***
factor(zipcode)98065 247714 21938 11.292 < 2e-16 ***
factor(zipcode)98070 206285 30016 6.872 6.49e-12 ***
factor(zipcode)98072 288764 22704 12.719 < 2e-16 ***
factor(zipcode)98074 404411 20091 20.129 < 2e-16 ***
factor(zipcode)98075 509382 21098 24.143 < 2e-16 ***
factor(zipcode)98077 401580 25032 16.042 < 2e-16 ***
factor(zipcode)98092 53726 21219 2.532 0.011348 *
factor(zipcode)98102 618200 31502 19.624 < 2e-16 ***
factor(zipcode)98103 303633 18849 16.109 < 2e-16 ***
factor(zipcode)98105 581630 23913 24.322 < 2e-16 ***
factor(zipcode)98106 38386 21474 1.788 0.073858 .
factor(zipcode)98107 297858 22873 13.022 < 2e-16 ***
factor(zipcode)98108 74484 25549 2.915 0.003556 **
factor(zipcode)98109 598429 30936 19.344 < 2e-16 ***
factor(zipcode)98112 814304 22800 35.716 < 2e-16 ***
factor(zipcode)98115 338706 18958 17.866 < 2e-16 ***
factor(zipcode)98116 337439 21558 15.652 < 2e-16 ***
factor(zipcode)98117 295600 19153 15.433 < 2e-16 ***
factor(zipcode)98118 136443 19485 7.002 2.59e-12 ***
factor(zipcode)98119 568253 25640 22.163 < 2e-16 ***
factor(zipcode)98122 353165 22322 15.822 < 2e-16 ***
factor(zipcode)98125 188261 20430 9.215 < 2e-16 ***
factor(zipcode)98126 143512 21173 6.778 1.25e-11 ***
factor(zipcode)98133 105817 19608 5.397 6.86e-08 ***
factor(zipcode)98136 270494 22948 11.787 < 2e-16 ***
factor(zipcode)98144 313353 21344 14.681 < 2e-16 ***
factor(zipcode)98146 78288 22364 3.501 0.000465 ***
factor(zipcode)98148 3714 40344 0.092 0.926659
factor(zipcode)98155 142531 20040 7.112 1.18e-12 ***
factor(zipcode)98166 183037 23182 7.896 3.03e-15 ***
factor(zipcode)98168 -40866 22800 -1.792 0.073078 .
factor(zipcode)98177 394990 23155 17.058 < 2e-16 ***
factor(zipcode)98178 29418 22973 1.281 0.200376
factor(zipcode)98188 7884 28480 0.277 0.781931
factor(zipcode)98198 21684 22541 0.962 0.336079
factor(zipcode)98199 510626 21788 23.436 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 283100 on 21535 degrees of freedom
Multiple R-squared: 0.4074, Adjusted R-squared: 0.4055
F-statistic: 214.5 on 69 and 21535 DF, p-value: < 2.2e-16
What is the estimate for the y-intercept of the model? Interpret the meaning of the y-intercept. What is the coefficient estimate for the dummy variable for zip code 98004? interpret this number too.
Answer: The estimate for the y-intercept of the model is 281195. This means that the baseline price for a house without considering zip code is 281,195 dollars. The coefficient estimate for the dummy variable for zip code 98004 is 1074732. This means that the baseline price for a house in the zip code 98004 is increased by 1,074,732 dollars.
As we saw in question 2 some of the variables do not have a linear
relationship with price. We will transform these variables
to make their relationship with price more linear in this
question.
Let \(X\) be the unique values in
grade and let \(Y\) be the
average house price for the corresponding \(X\)-value. Thus, the scatterplot that we
made in question 2.b for the average house price and grade
plots \(Y\) v.s. \(X\). From the scatterplot we see that the
relationship between \(Y\) and \(X\) is not linear and could be approximated
with the equation
\[ Y \approx (X)^{b} \]
for some constant \(b\).
Taking log on both sides results in
\[ log(Y) \approx b \times log(X) \]
Find out what the “best” value for \(b\) should be according to the data; i.e., find \(b\) such that \((log(Y_i) - b \times log(X))^2\) will be minimized for the 11 data points for the scatterplot on average.
lm(log(avg.price.grade) ~ log(sort(unique(house$grade))))
Call:
lm(formula = log(avg.price.grade) ~ log(sort(unique(house$grade))))
Coefficients:
(Intercept)
9.497
log(sort(unique(house$grade)))
1.948
Answer: The calculated \(b\) value of 1.948 is the “best” value for \(b\) according to the data.
Use the value that you found for \(b\) in part a and create the variable
trans.grade by transforming grade to \(trans.grade = (grade)^{b}\).
You can make the scatterplot that you made for the average house
prices v.s. the unique values in grade in question 2.d
again, except that now replace the unique values of grade
with that of trans.grade. You can then verify that the
scatterplot now shows a more linear pattern. This step should improve
the linearity between price and grade but it
will not make the relationship completely linear. You are not required
to turn in the scatterplot for this part.
trans.grade = (house$grade)^1.948
Here, we will make a new data frame for all the variables that we will need for building the model.
We will create the new data frame mod.variables with all
the transformed variables and some of the original variables.
mod.variables should include price,
sqft_living, sqft_basement,
grade, bedrooms, bathrooms and
date from the data frame house plus the
following transformed variables:
f.waterfront: the factor version of
house$waterfront;f.floor: the factor version of
house$floor;f.cond: the factor version of
house$condition;f.renov: a factor vector and each element in
f.renov is 1 if the corresponding element in
house$yr_renovated does not equal to zero, and 0
otherwise;f.bdrm.less.eq.8: a factor vector and each element in
f.bdrm.less.eq.8 is 1 if the corresponding element in
house$bedrooms is less or equal to 8, and 0 otherwise;f.zipcode: the factor version of
house$zipcode;and also the transformed variable
trans.grade: the vector trans.grade that
you have already made.mod.variables = with(data = house, data.frame(price = price, sqft_living = sqft_living,
sqft_basement = sqft_basement, grade = grade, bedrooms = bedrooms,
bathrooms = bathrooms, date = date, f.waterfront = as.factor(waterfront),
f.floor = as.factor(floors), f.cond = as.factor(condition),
f.renov = as.factor(as.numeric(yr_renovated != 0)), f.bdrm.less.eq.8 = as.factor(as.numeric(bedrooms <=
8)), f.zipcode = as.factor(zipcode), trans.grade = trans.grade))
Explain why based on one of the scatterplots that you made in
question 2.b, it makes sense to consider the variables
f.bdrm.less.eq.8 and f.bdrm.less.eq.8:bedrooms
in the model.
Answer: Based on the “Average House Price Per Number of Bedrooms”
scatterplot, it makes sense to include the variables
f.bdrm.less.eq.8 and f.bdrm.less.eq.8:bedrooms
in the model because the average house price data for houses with less
than or equal to 8 bedrooms is fairly linear, and the data for houses
with 9 or more bedrooms is also fairly linear, but if combined the best
fit line would be very inaccurate. Therefore, it makes sense to split
the data and have different terms for the <=8 bedroom data and >=9
bedroom data.
For this part you just need to remove the eval=F
argument for each of the code chunks below and run the code; no action
is required from you other than that.
mod.variables covers the period from May 2014 to May
2015. We will prepare a vector date.format that we can use
to extract out the observations that correspond to transactions closed
in May 2015.
date.format = format(as.Date(mod.variables$date, "%Y%m%dT"), "%Y%m")
length(date.format)
[1] 21605
head(date.format)
[1] "201410" "201412" "201502" "201412" "201502" "201405"
dim(mod.variables)
[1] 21605 14
Then, the following lines will extract out all the observations with
transactions closed in May 2015. We will use these observations for our
out-of-time test set test2.
test2 = mod.variables[date.format %in% "201505", !(names(mod.variables) %in%
c("date"))]
dim(test2)
[1] 646 13
For the rest of the observations run the following code chunk; this
will split the remaining observations into 3 sets: 70% for the training
set (i.e., train), 15% for the validation set (i.e.,
val) and 15% for in-time test set (i.e.,
test1).
# remove the `date` column from the data frame keep the rows
# that are not correspond to the period May 2015
tmp = mod.variables[!date.format %in% "201505", !(names(mod.variables) %in%
c("date"))]
s = nrow(tmp)
s
[1] 20959
set.seed(12122019)
permu = sample(1:s) # shuffle the row indices
train = tmp[permu, ][1:round(s * 0.7), ]
# randomly select 70% of the rows in tmp to form the training
# set
dim(train)
[1] 14671 13
val = tmp[permu, ][(round(s * 0.7) + 1):round(s * 0.85), ]
# randomly select 15% of the rows in tmp to form the validation
# set
dim(val)
[1] 3144 13
test1 = tmp[permu, ][(round(s * 0.85) + 1):s, ]
# randomly select 15% of the rows in tmp to form the in-time
# test set
dim(test1)
[1] 3144 13
# check to see what variables are in the datasets
colnames(test2)
[1] "price" "sqft_living" "sqft_basement"
[4] "grade" "bedrooms" "bathrooms"
[7] "f.waterfront" "f.floor" "f.cond"
[10] "f.renov" "f.bdrm.less.eq.8" "f.zipcode"
[13] "trans.grade"
We are now ready to build our model! mod.variables has
been divided into 4 sets:
train: The 14671 by 13 training setvalidation: The 3144 by 13 validation settest1: The 3144 by 13 in-time test settest2: The 646 by 13 out-of-time test setFor the rest of the project we will pick the number of predictors in our champion model in two ways: with and without cross-validation.
Use the non-test set (i.e., the combination of the training and validation sets) for this part.
First, find the “best” models by considering only the predictors that
already exist in the original dataset (i.e., all the variables in
non-test set except f.bdrm.less.eq.8 and
trans.grade). Use BIC and Adjusted \(R^2\) as the criteria to evaluate the
performance of your “best” models. You should use both backward and
forward selection algorithms. Do backward and forward selection
algorithms produce similar (in terms of the values for BIC and Adjusted
\(R^2\)) results in this case?
non_test = rbind(train, val)
dim(non_test)
[1] 17815 13
library(leaps)
g = regsubsets(price ~ sqft_living + sqft_basement + grade + bedrooms +
bathrooms + f.waterfront + f.floor + f.cond + f.renov + f.zipcode,
data = non_test, method = "backward", nvmax = 85, really.big = T)
h = regsubsets(price ~ sqft_living + sqft_basement + grade + bedrooms +
bathrooms + f.waterfront + f.floor + f.cond + f.renov + f.zipcode,
data = non_test, method = "forward", nvmax = 85, really.big = T)
i = regsubsets(price ~ sqft_living + sqft_basement + grade + bedrooms +
bathrooms + f.waterfront + f.floor + f.cond + f.renov + f.zipcode +
f.bdrm.less.eq.8 + trans.grade + f.waterfront:sqft_living +
f.waterfront:bathrooms + f.bdrm.less.eq.8:bedrooms, data = non_test,
method = "backward", nvmax = 90, really.big = T)
j = regsubsets(price ~ sqft_living + sqft_basement + grade + bedrooms +
bathrooms + f.waterfront + f.floor + f.cond + f.renov + f.zipcode +
f.bdrm.less.eq.8 + trans.grade + f.waterfront:sqft_living +
f.waterfront:bathrooms + f.bdrm.less.eq.8:bedrooms, data = non_test,
method = "forward", nvmax = 90, really.big = T)
library(ggplot2)
ggplot() + geom_point(aes(x = c(1:85), y = summary(g)$bic, color = "backward"),
alpha = 0.4) + geom_point(aes(x = c(1:85), y = summary(h)$bic,
color = "forward"), alpha = 0.4) + geom_point(aes(x = c(1:90),
y = summary(i)$bic, color = "backward adjusted"), alpha = 0.4) +
geom_point(aes(x = c(1:90), y = summary(j)$bic, color = "forward adjusted"),
alpha = 0.4) + labs(title = "BIC for best model for each given number of predictors",
x = "Given number of predictors", y = "BIC")
ggplot() + geom_point(aes(x = c(1:85), y = summary(g)$adjr2, color = "backward"),
alpha = 0.4) + geom_point(aes(x = c(1:85), y = summary(h)$adjr2,
color = "forward"), alpha = 0.4) + geom_point(aes(x = c(1:90),
y = summary(i)$adjr2, color = "backward adjusted"), alpha = 0.4) +
geom_point(aes(x = c(1:90), y = summary(j)$adjr2, color = "forward adjusted"),
alpha = 0.4) + labs(title = "Adjusted R-square for best model for each given number of predictors",
x = "Given number of predictors", y = "Adjusted R-square")
mod1 = lm(price ~ sqft_living + sqft_basement + grade + bedrooms +
bathrooms + f.waterfront + f.floor + f.cond + f.renov + f.zipcode,
data = non_test)
summary(mod1)
mod2 = lm(price ~ sqft_living + sqft_basement + grade + bedrooms +
bathrooms + f.waterfront + f.floor + f.cond + f.renov + f.zipcode +
f.bdrm.less.eq.8 + trans.grade + f.waterfront:sqft_living +
f.waterfront:bathrooms + f.bdrm.less.eq.8:bedrooms, data = non_test)
summary(mod2)
Secondly, repeat everything in the first step above but consider
additional 5 variables for your models: f.bdrm.less.eq.8,
trans.grade, the interaction terms
f.waterfront:sqft_living,
f.waterfront:bathrooms and
f.bdrm.less.eq.8:bedrooms. Again, use backward and forward
selection algorithms. Plot the BIC values with the ones produced in the
first step above to see how much improvement there is by considering the
additional 5 variables for your models. Similarly, Plot the Adjusted
\(R^2\) values with the ones produced
in the first step above. Since you are making comparisons here, please
make sure that all the values being compared are on the same graph.
As you can see from the graphs, creating the right variables helps to improve your model significantly! A seasoned data scientist knows how to create good variables by doing exploratory analysis on the data like what we did earlier in the report.
Answer: The backward and forward selection algorithms produce similar BIC and Adjusted \(R^2\) graphs, but they are not identical.
What is the purpose of using measures, such as BIC or Adjusted R-square, to evaluate the performance of a model? Why don’t we just use RSS instead?
Answer: We don’t just use RSS because of the threat of overfitting the model. With BIC or or Adjusted R-square, it becomes easier to find the best model without having too many predictors, such that the model would become useless for later testing and predictions.
Use cross-validation to select the number of predictors in your champion model. Since we saw in question 5.a that considering the 5 additional x-variables help improve the performance of the best models significantly, make sure that you consider the 5 additional x-variables together with the original variables.
g.train = regsubsets(price ~ sqft_living + sqft_basement + grade +
bedrooms + bathrooms + f.waterfront + f.floor + f.cond + f.renov +
f.zipcode + f.bdrm.less.eq.8 + trans.grade + f.waterfront:sqft_living +
f.waterfront:bathrooms + f.bdrm.less.eq.8:bedrooms, data = train,
method = "backward", nvmax = 90, really.big = T)
h.train = regsubsets(price ~ sqft_living + sqft_basement + grade +
bedrooms + bathrooms + f.waterfront + f.floor + f.cond + f.renov +
f.zipcode + f.bdrm.less.eq.8 + trans.grade + f.waterfront:sqft_living +
f.waterfront:bathrooms + f.bdrm.less.eq.8:bedrooms, data = train,
method = "forward", nvmax = 90, really.big = T)
# define a function to make prediction for an object that is the
# output of regsubsets()
predict.reg = function(object, newdata, id) {
form = as.formula(object$call[[2]]) # Extract the formula used when we called regsubsets()
mat = model.matrix(form, newdata) # Build the model matrix
coefi = coef(object, id = id) # Extract the coefficients of the id'th model
xvars = names(coefi) # Pull out the names of the predictors used in the ith model
as.vector(mat[, xvars] %*% coefi) # Make predictions using matrix multiplication
}
cal.mse.g = function(x) {
predicted.y = predict.reg(object = g.train, newdata = val, id = x)
return(mean((val$price - predicted.y)^2))
}
cal.mse.h = function(x) {
predicted.y = predict.reg(object = h.train, newdata = val, id = x)
return(mean((val$price - predicted.y)^2))
}
mse.g = sapply(1:90, FUN = cal.mse.g)
mse.h = sapply(1:90, FUN = cal.mse.h)
ggplot() + geom_point(aes(x = c(1:90), y = mse.g, color = "backward train"),
alpha = 0.4) + geom_point(aes(x = c(1:90), y = mse.h, color = "forward train"),
alpha = 0.4) + labs(title = "MSE v.s. the number of predictors in the \"best\" model",
x = "Number of predictors in the \"best\" model", y = "Mean squared error")
Is the number of predictors for the champion model that you found in part a the same as that you found with BIC and adjusted \(R^2\)? If the two numbers disagree use the one selected with cross-validation for the rest of the parts in this question.
Answer: The number of predictors for the “best” model in both the non-cross-validation and cross-validation sets falls in the range of 50-60 predictors. Knowing this, we chose to to use the cross-validation set just because we couldn’t be 100% sure.
Propose your champion model by printing out the coefficient estimates for your model with code. Justify your choice.
champion <- regsubsets(price ~ sqft_living + sqft_basement + grade +
bedrooms + bathrooms + f.waterfront + f.floor + f.cond + f.renov +
f.zipcode + f.bdrm.less.eq.8 + trans.grade + f.waterfront:sqft_living +
f.waterfront:bathrooms + f.bdrm.less.eq.8:bedrooms, data = non_test,
really.big = T, method = "backward", nvmax = 58)
coef(champion, id = 58)
(Intercept) sqft_living
1162292.67607 174.66971
sqft_basement grade
-44.38388 -368275.48335
bathrooms f.floor2
23012.56007 -44362.13580
f.floor3 f.cond4
-99238.25577 32831.36825
f.cond5 f.renov1
75658.31143 66808.38284
f.zipcode98004 f.zipcode98005
721862.43757 254113.52657
f.zipcode98006 f.zipcode98007
212525.71805 204101.40420
f.zipcode98008 f.zipcode98011
222632.77347 89007.76497
f.zipcode98023 f.zipcode98024
-73250.82439 128179.93981
f.zipcode98027 f.zipcode98028
120217.96867 96899.93882
f.zipcode98029 f.zipcode98033
180488.38017 310944.30027
f.zipcode98034 f.zipcode98039
158340.83297 1196652.52494
f.zipcode98040 f.zipcode98052
442205.53251 190602.67338
f.zipcode98053 f.zipcode98065
160199.69919 63823.28987
f.zipcode98072 f.zipcode98074
116341.93665 133606.85273
f.zipcode98075 f.zipcode98077
121147.04660 70431.31205
f.zipcode98092 f.zipcode98102
-66443.54965 492239.24391
f.zipcode98103 f.zipcode98105
315185.68245 433701.79963
f.zipcode98106 f.zipcode98107
83985.74730 317284.80320
f.zipcode98108 f.zipcode98109
83219.18713 497580.80234
f.zipcode98112 f.zipcode98115
588306.02480 309768.68572
f.zipcode98116 f.zipcode98117
282328.59066 288589.15281
f.zipcode98118 f.zipcode98119
127586.50976 486319.95850
f.zipcode98122 f.zipcode98125
319888.03140 175240.68976
f.zipcode98126 f.zipcode98133
165092.35078 129179.94293
f.zipcode98136 f.zipcode98144
240722.72965 257231.47857
f.zipcode98146 f.zipcode98155
72127.92764 104593.60845
f.zipcode98177 f.zipcode98199
223317.06946 361857.13914
trans.grade sqft_living:f.waterfront1
31248.34962 261.49674
bedrooms:f.bdrm.less.eq.81
-12397.63916
Justification: We thought this was the best model because the MSE plateaued at its minimum first around 58 predictors, and we chose backward selection because it produced visibly better results than forward selection.
Let \(k\) be the number of predictors that you chose for your champion model. Compare the best models with k predictors selected by the backward and forward algorithms. Do the algorithms pick the same set of predictors? Are there any predictors that are selected by the forward algorithm but not by the backward? List the names of these variables if there are any. Similarly, are there any predictors that are selected by the backward algorithm but not by the forward? List the variables too if there are any.
find.champion.b <- regsubsets(price ~ sqft_living + sqft_basement +
grade + bedrooms + bathrooms + f.waterfront + f.floor + f.cond +
f.renov + f.zipcode + f.bdrm.less.eq.8 + trans.grade + f.waterfront:sqft_living +
f.waterfront:bathrooms + f.bdrm.less.eq.8:bedrooms, data = non_test,
really.big = T, method = "backward", nvmax = 58)
find.champion.f <- regsubsets(price ~ sqft_living + sqft_basement +
grade + bedrooms + bathrooms + f.waterfront + f.floor + f.cond +
f.renov + f.zipcode + f.bdrm.less.eq.8 + trans.grade + f.waterfront:sqft_living +
f.waterfront:bathrooms + f.bdrm.less.eq.8:bedrooms, data = non_test,
really.big = T, method = "forward", nvmax = 58)
names.b = names(coef(find.champion.b, id = 58))
names.f = names(coef(find.champion.f, id = 58))
names.f[!(names.f %in% names.b)]
[1] "bedrooms" "f.floor2.5" "f.cond3"
[4] "f.zipcode98042"
names.b[!(names.f %in% names.b)]
[1] "bathrooms" "f.cond4" "f.renov1"
[4] "f.zipcode98065"
Answer: The algorithms do not pick the same set of predictors. There are 4 predictors selected by the forward algorithm but not by the backward, and likewise 4 predictors selected by the backward algorithm but not by the forward. See the above code annotations for the corresponding lists of differing variables.
For your champion model plot the residual plot, the qq plot and the histogram for the distribution of the residuals. What are the assumptions on the distribution of the errors for a linear model? Do your plots suggest any possible violation of the assumptions? Explain.
predicted.y = predict.reg(object = champion, newdata = val, id = 58)
summary(predicted.y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
62994 321405 471345 545179 667792 5290385
summary(val$price)
Min. 1st Qu. Median Mean 3rd Qu. Max.
90000 325000 450000 547180 654000 5300000
res = val$price - predicted.y
plot(predicted.y, res)
abline(lm(res ~ predicted.y), col = "red")
qqnorm(res)
qqline(res, col = "red")
hist(res, freq = F, breaks = 30)
Answer: The assumptions on the errors for a linear model are that the errors are independent and also that the errors follow a normal distribution (0, sd = \(\sigma\)). Our plots do not suggest any possible violation of these assumptions, considering that the histogram of the distribution of the residuals appears to be normal, our qq plot is fairly linear, and our plot of the residuals vs. the predicted y-values has a best fit line that is very close to y=0, demonstrating that our model is sound and works pretty well. **Note: We checked on Piazza and determined from a post that checking the independence assumption is outside the scope of this course. What we have written here is what we believe to be the correct answer, but we could be mistaken.
Estimate the mean squared errors for your champion model with the in-time and out-of-time test sets. Is the MSE estimated with the out-of-time test set bigger or smaller than the MSE estimated with the in-time test set? Is this result expected?
cal.mse.in.time = function(x) {
predicted.y = predict.reg(object = champion, newdata = test1,
id = x)
return(mean((test1$price - predicted.y)^2))
}
cal.mse.out.time = function(x) {
predicted.y = predict.reg(object = champion, newdata = test2,
id = x)
return(mean((test2$price - predicted.y)^2))
}
mse.out.time = sapply(1:58, FUN = cal.mse.out.time)
mse.in.time = sapply(1:58, FUN = cal.mse.in.time)
ggplot() + geom_point(aes(x = c(1:58), y = mse.in.time, color = "In-Time"),
alpha = 0.4) + geom_point(aes(x = c(1:58), y = mse.out.time,
color = "Out-Of-Time"), alpha = 0.4) + labs(title = "Testing the Champion Model with the In- & Out of Time Test Sets",
x = "Number of predictors in the champion model", y = "Mean squared error")
Answer: The MSE of the out-of-time test set is bigger than that of the in-time test set. This makes sense because the champion model is based off of similar data to that of the in-time test set.